The purpose of this notebook is to demonstrate the conversion of long-format data into wide-format. Long-format data contains one row per available alternative per choice situation. In contrast, wide-format data contains one row per choice situation. PyLogit and other software packages (e.g. mlogit in R) use data that is in long-format. However, other software packages, such as Statsmodels in Python or Python BIOGEME, use data that is in wide-format.
Because different software packages have different data format requirements, it is useful to be able to convert one's data from one format to another. Other PyLogit example notebooks (such as the "Main PyLogit Example") demonstrate how to take data from wide-format and convert it into long-format. This notebook will demonstrate the reverse process: taking data from long-format and converting it into wide-format.
The dataset being used in this example is the "Travel Mode Choice" dataset from Greene and Hensher. It is described on the statsmodels website, and their description is reproduced below in full.
The data, collected as part of a 1987 intercity mode choice study, are a sub-sample of 210 non-business trips between Sydney, Canberra and Melbourne in which the traveler chooses a mode from four alternatives (plane, car, bus and train). The sample, 840 observations, is choice based with over-sampling of the less popular modes (plane, train and bus) and under-sampling of the more popular mode, car. The level of service data was derived from highway and transport networks in Sydney, Melbourne, non-metropolitan N.S.W. and Victoria, including the Australian Capital Territory. Number of observations: 840 Observations On 4 Modes for 210 Individuals. Number of variables: 8 Variable name definitions:: individual = 1 to 210 mode = 1 - air 2 - train 3 - bus 4 - car choice = 0 - no 1 - yes ttme = terminal waiting time for plane, train and bus (minutes); 0 for car. invc = in vehicle cost for all stages (dollars). invt = travel time (in-vehicle time) for all stages (minutes). gc = generalized cost measure:invc+(invt*value of travel time savings) (dollars). hinc = household income ($1000s). psize = traveling group size in mode chosen (number). Source Greene, W.H. and D. Hensher (1997) Multinomial logit and discrete choice models in Greene, W. H. (1997) LIMDEP version 7.0 user’s manual revised, Plainview, New York econometric software, Inc. Download from on-line complements to Greene, W.H. (2011) Econometric Analysis, Prentice Hall, 7th Edition (data table F18-2) http://people.stern.nyu.edu/wgreene/Text/Edition7/TableF18-2.csv
In [1]:
# To access the Travel Mode Choice data
import statsmodels.datasets
# To perform the dataset conversion
import pylogit as pl
In [3]:
# Access the dataset
mode_data = statsmodels.datasets.modechoice.load_pandas()
# Get a pandas dataframe of the mode choice data
long_df = mode_data["data"]
# Look at the dataframe to ensure that it loaded correctly
long_df.head()
Out[3]:
The function in PyLogit that is used to convert long-format data to wide-format data is "convert_long_to_wide," and it can be accessed through "pl.convert_long_to_wide". The docstring for the function contains all of the information necessary to perform the conversion, but we will leave it to readers to view the docstring at their own leisure. For now, we will simply create the needed objects/arguments for the function.
In particular, we will need the following 7 objects:
The cells below will show exactly what these objects are.
In [10]:
# ind_vars is a list of strings denoting the column
# headings of data that varies across choice situations,
# but not across alternatives. In our data, this is
# the household income and party size.
individual_specific_variables = ["hinc", "psize"]
# alt_specific_vaars is a list of strings denoting the
# column headings of data that vary not only across
# choice situations but also across all alternatives.
# These are columns such as the "level of service"
# variables.
alternative_specific_variables = ["invc", "invt", "gc"]
# subset_specific_vars is a dictionary. Each key is a
# string that denotes a variable that is subset specific.
# Each value is a list of alternative ids, over which the
# variable actually varies. Note that subset specific
# variables vary across choice situations and across some
# (but not all) alternatives. This is most common when
# using variables that are not meaningfully defined for
# all alternatives. An example of this in our dataset is
# terminal time ("ttme"). This variable is not meaningfully
# defined for the "car" alternative. Therefore, it is always
# zero. Note "4" is the id for the "car" alternative
subset_specific_variables = {"ttme": [1, 2, 3]}
# obs_id_col is the column denoting the id of the choice
# situation. If one was using a panel dataset, with multiple
# choice situations per unit of observation, the column
# denoting the unit of observation would be listed in
# ind_vars (i.e. with the individual specific variables)
observation_id_column = "individual"
# alt_id_col is the column denoting the id of the alternative
# corresponding to a given row.
alternative_id_column = "mode"
# choice_col is the column denoting whether the alternative
# on a given row was chosen in the corresponding choice situation
choice_column = "choice"
# Lastly, alt_name_dict is not necessary. However, it is useful.
# It records the names corresponding to each alternative, if there
# are any, and allows for the creation of meaningful column names
# in the wide-format data (such as when creating the columns
# denoting the available alternatives in each choice situation).
# The keys of alt_name_dict are the unique alternative ids, and
# the values are the names of each alternative.
alternative_name_dict = {1: "air",
2: "train",
3: "bus",
4: "car"}
In [12]:
# Finally, we can create the wide format dataframe
wide_df = pl.convert_long_to_wide(long_df,
individual_specific_variables,
alternative_specific_variables,
subset_specific_variables,
observation_id_column,
alternative_id_column,
choice_column,
alternative_name_dict)
# Let's look at the created dataframe, transposed for easy viewing
wide_df.head().T
Out[12]:
As we can see above, PyLogit does a few things automatically. First, using the names provided in alt_name_dict, it will add suffixes to the alternative specific variables and the subset specific variables. These suffixes record what alternative, the given column of data is referring to. Secondly, when dealing with subset specific variables, PyLogit will only create columns of data for alternatives over which the variable actually varies. Lastly, PyLogit automatically creates columns that denote the availability of each alternative for each choice situation. These columns are suffixed to denote the alternatives that they correspond to, and they are inferred automatically from the rows present in the long-format data.
Also, there is a "null_value" keyword that one can use in the conversion function. This is useful when one has alternative specific variables, and not all alternatives are available in all choice situations. In this setting, one may want to specify a value for the missing data, such as null, -999, etc. The "null_value" keyword argument allows one to do this.